作者:Jinjing Zhu Haotian Bai Lin Wang
最近,人们努力利用视觉转换器(ViT)来完成具有挑战性的无监督领域自适应(UDA)任务。它们通常采用ViT中的交叉注意来进行直接的结构域比对。然而,由于交叉注意的性能在很大程度上取决于目标样本的伪标签的质量,因此当域间隙变大时,交叉注意的效果会变差。我们从博弈论的角度解决了这个问题,提出了一个称为PMTrans的模型,该模型将源域和目标域与中间域连接起来。具体而言,我们提出了一种新的基于ViT的模块,称为PatchMix,该模块通过学习基于相同理论模型对两个领域的补丁进行采样,有效地建立了中间领域,即概率分布。通过这种方式,它学会了混合源域和目标域的补丁,以最大化交叉熵(CE),同时利用特征和标签空间中的两个半监督混合损失来最小化它
Endeavors have been recently made to leverage the vision transformer (ViT)for the challenging unsupervised domain adaptation (UDA) task. They typicallyadopt the cross-attention in ViT for direct domain alignment. However, as theperformance of cross-attention highly relies on the quality of pseudo labelsfor targeted samples, it becomes less effective when the domain gap becomeslarge. We solve this problem from a game theory’s perspective with the proposedmodel dubbed as PMTrans, which bridges source and target domains with anintermediate domain. Specifically, we propose a novel ViT-based module calledPatchMix that effectively builds up the intermediate domain, i.e., probabilitydistribution, by learning to sample patches from both domains based on thegame-theoretical models. This way, it learns to mix the patches from the sourceand target domains to maximize the cross entropy (CE), while exploiting twosemi-supervised mixup losses in the feature and label spaces to minimize it. Assuch, we interpret the process of UDA as a min-max CE game with three players,including the feature extractor, classifier, and PatchMix, to find the NashEquilibria. Moreover, we leverage attention maps from ViT to re-weight thelabel of each patch by its importance, making it possible to obtain moredomain-discriminative feature representations. We conduct extensive experimentson four benchmark datasets, and the results show that PMTrans significantlysurpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home,+1.4% on Office-31, and +17.7% on DomainNet, respectively.
论文链接:http://arxiv.org/pdf/2303.13434v1
更多计算机论文:http://cspaper.cn/